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\  ABSTRACT 

This  paper  presents  a  method  for  calculating  the  likelihood  function  of 
autoregressive-moving  average  (APMA)  models  for  time  series  data.  Model 
estimation  requires  maximization  of  the  likelihood,  and  to  assist  in  this,  a 
method  for  calculating  derivatives  of  the  function  is  also  presented.  The 
computational  efficiency  is  competitive  with  that  of  other  algorithms  for  this 
purpose.  Extensions  which  allow  for  seasonal  models,  missing  data,  and  the 
estimation  of  a  data  transformation  are  also  described.  - 


AMS  (MOS)  Subject  Classifications:  62M10,  65U05 

Key  Words:  Autoregressive-moving  average  model:  Time  series  estimation;  Exact 
likelihood 

Work  Unit  Number  4  (Statistics  and  Probability) 
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SIGNIFICANCE  AND  EXPLANATION 


This  paper  is  concerned  with  the  statistical  analysis  of  time  series, 
i.e.  records  of  observations  of  variables  in  fields  as  diverse  as  economics, 
engineering,  meteorology  ...  .  The  class  of  ARMA  models  has  been  proved  to  be 
most  successful  in  representing  the  dynamic  structure  of  such  series  and  are 
useful  in  both  prediction  and  control,  and  in  the  investigation  of 
relationships  between  series.  The  models  are  in  the  form  of  discrete  time 
difference  equations  relating  present  values  of  the  series  to  past  values, 
with  a  prediction  error  term.  It  is  important  to  have  precise  tools  for  the 
estimation  of  the  coefficients  in  these  equations,  and  for  discriminating 
between  different  models.  Computation  of  the  exact  likelihood  of  a  model  is 
important  in  this  respect,  and  so  is  estimation  of  the  coefficients  by 
maximizing  the  likelihood.  Several  algorithms  for  computing  the  likelihood 
are  now  available,  and  the  one  presented  in  this  paper  is  highly  competitive 
on  the  grounds  of  numerical  accuracy  and  efficiency.  It  has  been  implemented 
and  proved  useful  for  modelling  a  wide  variety  of  data. 
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THE  ESTIMATION  OF  TIME  SERIES  MODELS.  PART  I.  YET  ANOTHER  ALGORITHM 
FOR  THE  EXACT  LIKELIHOOD  OF  ARMA  MODELS 
G.  Tunnlclif fe-Wilson 


1.  INTRODUCTION 

The  basic  autoregressive-moving  average  model  of  orders  p  and  q,  or  ARMA  (p,q) 

model,  for  a  time  series  xt,  is  defined  by 

■  A  v  ♦  •••  ♦  4  v  ♦  a..  -  B.a  — •••-0a  (1.1) 

"t  Tfl  *p  t-p  t  1  t-1  q  t-q 

where  wfc  »  xt  -  c,  the  constant  c  representing  the  expectation  of  xt.  The  unobserved 

series  a^  is  assumed  to  be  a  sequence  of  independent  random  variables  from  a 
2 

Normal  (0,o  )  distribution.  In  terms  of  the  backward  shift  operator  notation  of  Box  and 

Jenkins  (1976)  ths  model  is  conveniently  written  8(B)w^  “  0(B)at  where 

♦  (B)  “1  -  ♦  R  -  •  •  •  -  ♦  Bp  and  8(B)  »  1  -  e.B  -  •••  -  8  b1*.  The  coefficients  of  the 
i  p  I  q 

model  are  constrained  by  a  statlonarity  condition  that  8(8)  *  D  for  R  «  1.  This 
ensures  that  the  process  reaches  statistical  equilibrium  or  stationarity .  A  further 


condition  is  that  8(B)  *  0  for  |B|  <  1,  which  allows  identification  of  afc  with  the 
linear  innovations  (or  one  step  ahead  prediction  errors)  of  the  process,  i.e. 
afc  »  xfc  -  E(xtlxt_1,xt_2...).  This  includes  the  borderline  case  when  8(B)  may  have 
zeros  on  |B|  •  1. 


Given  a  finite  sample  x1(...,xn  (and  assuming  the  process  has  reached 

2 

statlonarity),  the  likelihood  of  the  model  parameters  S  "  (c,o  ,♦,,...,♦  ,8 ...... 8  )  may 

i  p  '  q 

be  formally  presented  using  the  covariance  matrix  T  of  the  sample,  which  has  elements 

2 

r  ”  covlx^x^)  ■  We  may  set  »  a  V ^  where  V  depends  only  on 

♦  "  (♦,  •••  ♦  )  and  8  «  (8  •••  8  ).  Then  the  likelihood  is 

1  P  1  <1  1 

LIB)  •  o_nM'  exp(- V2C/o2) 

where  Q  «  w'v'^w  is  a  quadratic  form  in  w  ■  (w,  •••  wn)’  and  M  »  det  V. 
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Whan  maximized  w.r.t.  o*,  L  “  D-"^2  where  D  •  M1,/nQ.  The  K.L.I.s  of  c,8,8  aay 
then  he  found  by  minimizing  D,  which  we  ehali  call  the  deviance.  The  factor  M(8,6,n) 
doe a  not  depend  on  the  data,  and  tends  to  a  limit  M(8,8)  as  n  ♦  •  .  Consequently 

♦  1  for  eny  fixed  8  and  6,  and  is  often  neglected  in  large  samples,  but  is 
Important  when  n  is  small,  or  8,8  are  close  to  their  constraint  boundaries. 

The  form  Q  may  be  evaluated  by  many  devices,  most  of  which  involve  a  sum  of  squares 

2 

function  roughly  of  the  form  p  *  I  a£,  where  the  terms  at  are  regenerated  from  the 
data  via  the  modal  (1.1).  Lack  of  knowledge  of  xfc  for  t  <  0  (the  end  effect)  however, 
prevents  exact  evaluation  of  a 1  •••  an>  so  they  are  estimated  by  various  means  ranging 
f re*  the  inqenioue  beckforecasting  scheme  of  Box  and  3enkins  ( 1976)  to  the  generel  methods 
of  Newbold  (1974)  and  Ljung  and  Box  (1979).  More  recently  p  has  been  calculated  as  a 
weighted  sum  of  squares  of  Innovations  based  on  the  finite  data  set, 
bfc  ■  xt  -  E(xtlxt_1  •••  x,),  so  that 

.  2 

n  b  n 

«>  -  Hr)  * M  • "  ht  • 

i  t  i 

Although  afc  -  bfc  ♦  0  and  hfc  ♦  1  as  t  increases,  the  residuals  may  differ 

2  2 

appreciably  for  small  values  of  t,  e.g.  b,  ”  x1  and  h1  »  var(xt)/a  always.  The 

projection  techniques  for  constructing  bfc  may  be  based  on  direct  use  of  Cholesky 
factorisation  as  in  Ansley  (1979)  or  on  the  Kalman  filter  as  in  Gardner  et  al.  (1980). 

They  are  computationally  efficient,  and  by  exploiting  the  special  structure  of  the  ARM A 
model  within  the  Kalman  filter,  Milard  (1903)  has  produced  an  algorithm  which  is  extremely 
economical  in  its  use  of  arithmetical  operations  and  core  storage.  The  algorithm 
presented  here  follows  more  closely  the  classical  approach  of  Ljung  and  Box  but  exploits 
simple  special  methods  which  have  been  long  known  for  pure  AR(p)  and  MA(q)  models. 

This  simplicity  allows  convenient  analytic  computation  of  the  derivatives  of  the  deviance 


for  us*  in  minimisation  routines.  Not  only  is  this  more  accurst*  than  the  us*  of 


numerical  derivatives,  it  may  also  b*  computationally  much  cheaper  when  the  orders 
are  greater  than  one. 


p.q 


3- 


2.  DEVELOPMENT  OP  THE  ALGORITHM 


It  ia  convenient  to  reproduce  first  tha  algorithms  for  tha  cases  of  tha  pure  AR(p) 
and  MA(q)  models  since  these  are  the  basis  for  tha  general  ARMA(p,q)  case. 

For  the  AR(p)  model  define  series 


ft 

<♦ 

1 

s 

1 

4  w  -•••-Aw  t 

M  t-1  Vc-p 

t 

a  1  • ••  n 

bt  -  wt  - 

Aw  —•••—Aw  ■ 

*1  t+1  P  t+p 

t 

*  (1  -  p) 

where  the  values  of  w  for 

t  <  0  are  all  taken  as 

0. 

Then 

n 

y. 

t-i 


0 

I 

t“1-p 


The  factor  m  remains  constant  for  n  >  p  in  this  case,  so  is  found  from  the  case 
n  “  p  by  extracting  from  Q  the  elements  of  V-1  »  p  as 

p-i  p-t 

Fij  “  J_0  Vk+t  -  Jij  Wt  for  i'i  -  1  •••  p  ««  A  >  3  ' 

where  f Q  «  1 ,  f «  -4^  for  k  *  1  • • •  p  and  *  ■  i  -  j . 

M  may  then  be  obtained  as  1/det  P,  though  we  shortly  provide  a  more  efficient 
procedure  for  its  calculation. 

The  above  expression  for  Q  is  to  be  found  in  Ljung  and  Box  (1979)  equn.  (4.4)  and 
is  derived  by  exploiting  the  time  reversal  symmetry  of  the  model  as  in  Bov  and  Jenkins 
(1976),  Appendix  A7.5. 

The  calculation  of  det  F  follows  a  scheme  based  on  the  reverse  of  the  Durbin- 

Levinson  algorithm  as  presented  by  Tunniclif fe-Wilson  (1979),  p.  303.  It  is  based  on  the 

reduction  of  fQ  •••  f  to  a  new  set,  say  f^  •••  f^,  where  p*  *  p  -  1,  by  fj  ■  1» 

fl  ■  (f,  -  f.  f.  ,  )/T  for  J  “  1  •••  p'  where  T  »  (1  -  fj;).  This  is  repeated  with 

3  3k  k-j  P  P  P 

f',P*  replacing  f,p  until  p'  “  1.  Then 


det  r 


1  1  T 
k-1 


(2.1) 
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.  -e*.  *■- 


This  also  provides  a  check  on  the  stationarity  condition  which  is  satisfied  iff 

>  0  -  see  Duffin  (1969).  For  the  MA(q)  model,  the  first  step  is  to  introduce 

variables  (a^^,  •  •  •  ,afl  )  •  a£  corresponding  to  the  pre-observation  period  innovations. 

If  these  are  supplied,  the  remaining  set  (a,  •••  an>  -  a^  may  be  regenerated  by 

a  -  w  +  6a  +  •••  +  9  a  ,  t  -  1  •••  n  (2.2) 

t  t  1  t-1  q  t-q 

and  a  sum  of  squares  calculated: 


S(aL>  “  I  *1  • 

,_q 


Then  Q  "  min  S.  The  minimizing  set  a  may  be  found  by  linear  least  squares  and  M 

L 

obtained  as  the  determinant  of  the  matrix  in  the  least  squares  equations.  Details  are 
given  in  Box  and  Jenkins  (1976)  Appendix  A7.4.  Following  ideas  in  Ljung  and  Box  (1979)  it 
is  convenient  to  use  the  backforecasts  (w^  ,...,Wq)  -  w^  as  an  equivalent  set  of 
variables  in  place  of  a^.  The  least  squares  equations  are  then  of  the  form 

X  *Xw  «  X'a 

L 

where  a*  »  (a£,a^)  with  a^  *  0,  and  X^  ■  * l-j'  the  se^uence  being  the  model 

(-weights.  The  Jacobian  between  aL  and  w^  is  1,  so  again  M  -  det(X'X). 

The  combined  algorithm  for  the  mixed  model,  ARMA(p,q),  again  requires  the 


introduction  of  the  set  w^,  which  is  to  be  estimated  by  w^.  In  this  process  it  is 

natural  initially  to  set  w^  »  0,  but  because  estimation  of  the  model  parameters  B  is 

Iterative,  it  is  convenient  to  initialize  to  the  value  w^  from  the  previous 

Iteration  and  then  to  estimate  a  correction  £w  .  Thus  we  take  w.  to  be  set  to  some 

L  I* 

initial  value,  not  necessarily  0.  The  procedure  is  then  to  regenerate 


) 


-  Vt.t 


-  ♦  w  +8.b 
p  t+p  1  t-1 


+  0  b  , 
q  t-q 


(1-p-q) 


taking  wfc  »  afc  *  bfc  »  0  for  t  <  1  -  q. 


-5- 


Setting  (a 


a  ate.  we  shall  rapreaant  (2.3)  by 


a  )' 
n 


♦  (B) 
6(B) 


«  (B)wf 


♦  (B~'l 
8(B) 


X  (B)w 


(2.4) 


tha  vactora  being  of  length*  n  ♦  q,p  respectively  in  these  two  equations.  The  1  sub  of 
squares*  function  is  then 


S(wL)  -  J  a*  -  ,  (2.5) 

t”  1  -q  t“ 1 -p-q 

and  again,  Q  «  min  S  •  S(w  ).  “Ms  minimization  requires  calculation  of  the  coefficients 

w  L 

L 

(»,...,«  ,)  "  »*  and  (x  ,.*.,X  ,>  “  X*»  by  applying  the  same  equations  (2.3)  to 

u  nt-q- 1  -p  "  i 

an  impulse  sequence  T  •  (1,0,0...)  in  place  of  w.  Thus 

*  -  *(B)Ij  x  ”  X ( B) I  .  (2.6) 

The  estimate  w^  is  ♦  Sv^,  where  4w^  is  the  solution  of  Afiw^  »  -G,  the 
terms  in  this  equation  being  given  by 


n*q 


■«*«-» '  J,  . . . 

n+q  0 

Gi  *  l  ¥t-i*t-a  '  $  Xt-ibt-o!  1  “ 

t«1  C  1  C  q  t-1-p  C  1  *  q 


(2.7) 


(2.8) 


The  minimum  value  Q  may  be  obtained  as  S(w  ),  or  as 

L 

Q  -  S(wL)  -  («**l>’G  . 

The  factor  M  in  the  likelihood  is  then  given  by  M  =  det  A/det  F  where  F  is  defined 
from  the  Autoregressive  parameters  as  for  the  pure  AR(p)  model. 


An  example  -  the  ARMA(1,1)  model. 

Taking  wQ  “  0  calculate  afc  =  wfc  -  $w  ^  +  8a  ^»t«1»««n  and  b_^  “  0.  Also 

n 

»0  "!»»,*(♦“  8)  »n+1  =  (♦  -  0)9°,  and  x_1  “  "♦•  Then  S(w„)  -  J  a*. 


a  -  i  ♦  (♦  -  e)2o  ♦  e2  ♦  ♦  e2")  -  ♦*.  9  *  <♦  -  «>  I  6t_1a  • 

i 

w„  -  -9/A,  Q  -  S(wQ)  -  wQg  . 

2 

r  -  1  -  ♦  and  on  simplifying  , 

f  (4  -  8)2o  -  e2n)/{(i  -  42hi  -  e2»  if  in 


m  -  A/r  -  i  + 


[n(1  -  4)/d  ♦  #)  if  8  -  1  . 

Verification  of  the  algorithm.  Express  the  model  (1.1)  using  an  intermediate  series  , 
as  w..  **  8(B)v  where  4(B)v  *  a.  .  Take  as  before,  v  »  (v  ,...,vft), 

t  t  t  t  i*  q  v 

vR  *  (V|,>..,v  ).  Then  we  have  the  MVN  pdf 

f(v)  -  f(vL,vR)  «  o  "det  F  ^  exp-1/2  S/o2  (2.9) 

where  S  «  I  a^  -  E  with  at«bt  derived  froei  v  as  for  the  AR(p)  model. 

Defining  now  (wL,wR)  ”  8(b)(vl,vr),  we  have  wR  as  precisely  the  set  of  observed 
series  values,  whereas  wL  must  be  considered  as  merely  a  linear  transformation  of  v^ 
and  not  as  a  set  of  true  series  values,  because  in  its  definition  vfc  is  set  to  0  for 
t  <  1  -  q.  Hie  Jacobian  of  all  these  transformations  is  1,  so  (2.9)  is  also  the  MVN 
pdf  f(w£,wR).  Also,  S  as  a  function  of  w  corresponds  exactly  to  the  constructions 
(2.3)  and  (2.5).  By  the  properties  of  the  MVN  distribution,  any  set  of  variables  (in 
this  case  w^)  may  be  integrated  out  by  minimizing  the  exponential  term  S,  which  in 
effect  expresses 

S<Wl )  ’  0  +  <*1,  -  -l>A(wl  -  wL) 

where  Q  »  Slw^).  The  usual  algebra  then  gives  the  pdf  f(wR)  by  replacing  S  by  0  in 
(2.9)  and  introducing  the  factor  1/det  A  ^2  to  give  the  stated  form  of  M. 


-7- 


3.  COMPUTATIONAL  REQUIREMENTS 


We  present  the  dominant  terns  in  the  number  of  multiplications  (mults)  required  for 

calculating  the  deviance.  It  is  worthwhile  considering  here  the  possibility  of  'sparse* 

12 

seasonal  models  e.g.  the  multiplicative  operator  (1  -  9b) (1  -  8B  )  used  in  the  airline 
model.  We  shall  use  q  for  the  total  number  of  coefficients  in  such  a  moving  average 

operator,  e.g.  two  in  this  case,  and  s  for  the  maximum  order,  e.g.  13  here.  We  use  p 

in  a  similar  manner  to  q.  Thus  to  calculate  one  term  afc  needs  p  +  q  mults,  but  we 
need  s  'backforecasts'  wL*  Most  of  the  calculations  for  the  series  bt  are  small 
compared  with  those  for  afc  so  are  left  out.  Similarly  we  shall  use  n  where  in  some 

cases  we  should  strictly  use  n  +  q  e.g.  for  the  length  of  at* 

Step  (2.3)  to  regenerate  afc  requires  n(p  +  q)  mults. 

Step  (2.6)  to  obtain  x  requires  nq  since  for  i  >  p  the  AR  terms  are  not  required. 

Step  (2.7)  to  set  up  A  requires  ^  nr ^  mults  as  it  stands  and  exploiting  symmetry. 

However,  following  Ljung  and  Box  there  are  two  possible  economies.  Computing  the  first 

column  of  A,  i.e.  Ai  i»  ^  “  1  ***  8  requires  only  ns  mults  and  other  entries  may  be 

obtained  by  A,  A,  .  -  x  x  ,  requiring  few  mults.  Furthermore  the  first 

i+l,j+l  i,3  n+s-i  n+s-3 

column  in  A  may  itself  be  found  as  the  first  s  value  in  the  sequence  determined  as 
x (B  ')x.  This  has  an  advantage  for  seasonal  models  of  requiring  only  nq  mults,  if 

organised  carefully.  Similarly,  by  constructing  x(B  'la  the  elements  of  G  can  be 

obtained  in  nq  rather  than  ns  mults.  To  apply  x(B  ')  efficiently  in  this  case  would 
be  to  compute  e.g. 


“  a  t  9  e  t  ...  t  9  e  , 
t  t  1  t+1  q  ttq 


•  •  1  - 


taking  et  •  0  for  t  >  n,  then 

Gi  *  *i-q  -  ♦iVt-q  - 

Solution  of  the  equations  and  evaluation  of  det  A,  may  be  performed  using  a 

3  1  2 

Cholesky  decomposition  with  (1/6)s  mults.  Computation  of  det  F  requires  73  p  mults 


P  e  ,  i  »  1  "..  s 
P  i+p-q 


which  we  neglect,  giving  a  minimum  number  of  n(p  +  4q)  t  (1/6)sJ  mults. 


and  associated  with  it  is  a  storage 


The  main  penalty  here  is  the  term  in  s^, 

requirement  of  /jj  real  numbers  for  A.  for  many  applications  this  is  no  burden, 

including  seasonal  models  with  typical  values  of  s  *  13.  However,  in  some  large 

organizations  models  of  hourly  data  with  a  seasonal  period  of  one  week  are  routinely 

fitted  (using  at  present  the  classical  backforecasting  scheme  of  Box  and  Jenkins)  and  here 

a  value  of  s  m  168  +  24  +  1  “  193  is  common.  In  this  context  the  method  of  M^lard  would 

be  expected  to  have  a  distinct  advantage.  The  special  structure  of  A  does  give  some 

hope  of  treatment  by  special  means  such  as  have  been  presented  by  Dickinson  (1978)  which 

2 

may  reduce  the  number  of  multiplications  to  0(s  )  and  storage  to  0(s). 

For  a  pure  moving  average  multiplicative  seasonal  model,  Hillmer  and  Tiao  (1979)  also 
provide  a  scheme  which  reduces  the  number  of  operations  in  this  stage  to  O(q^). 

In  circumstances  where  the  sequence  »  can  be  considered  to  decay  to  0  before 
t  =  n,  it  is  possible  to  use  the  classical  backforecasting  scheme  to  generate  wf ,  and 

the  method  of  McLeod  (1977)  to  determine  the  factor  M.  The  formula  (2.1)  may  be 

2 

exploited  in  McLeod's  procedure,  so  that  the  number  of  computations  is  0(p  +  q)  . 

Given  that  much  of  the  execution  of  a  statistical  package  is  occupied  with 
organizational  aspects,  computational  efficiency  in  the  main  algorithm  is  not  always 
vital.  The  advent  of  parallel  processing  may  favor  some  algorithms  which  can  exploit  this 
feature,  and  which  were  previously  uncompetitive.  Nevertheless,  problems  are  bound  to 
arise  which  stretch  the  resources  available,  and  effort  put  into  algorithmic  efficiency  is 


then  appreciated 


4.  COHPUTATION  OF  DERIVATIVES 

The  derivative  of  D  wrt  the  model  parametera  6  may  be  expressed  aa 
M1/n{<0/n)O/36)(log  det  A  -  log  det  F)  ♦  3Q/3B)  -  M1/nG(6) 
where,  now  introducing  B  as  a  parameter, 

C(B)  -  min  S(w  ,B)  -  S(w<6),8)  . 

L  Li 


(4.1) 


30  3S  *  3S  *  ®a* 

From  this,  r?  “  —  (w  ,8),  since  —  -  0  at  w  .  Now  writing  e.g.  aBt  for 

0  P  OP  Li  dW  L  t  dp 


have  simply 


—  -  2(1  ataBfc  -  £  ^bB^  ,  (4.2) 

where  the  summations  run  as  in  (2.5).  He  now  turn  to  the  computation  of  the  sequences 
aB.bB  and  use  operator  notation  such  as  in  (2.4)  to  represent  equations  such  as  (2.3) 
which  are  actually  used  for  the  computation.  Recall  that  when  such  equations  are  used  all 
variables  associated  with  times  before  the  first  point  in  the  sequence  are  taken  as  0. 

For  the  model  constant  term  c,  ac  »  -»(B)u  where  u  is  a  sequence  of  ones  from 
t  =  1,  with  zeros  for  t  <  0.  In  fact  act  “  act_i  “  *fc  1  for  t  >  1,  which  saves  on 
computation.  (Note  that  for  the  sequence  *,  our  indexing  convention  is  chosen  to  agree 
with  the  notation  of  Box  and  Jenkins,  so  that  the  first  term  in  the  sequence  *  is  Kg, 
even  though  it  is  associated  with  the  first  time  point  t  «  1  -  q.) 

For  all  the  autoregressive  parameters  we  need  generate  only  two  series 

a*  -  -1/8(B)w,  b$  -  -B~P/8(B)w  (4.3) 

from  which  may  be  obtained  all  the  required  derivatives 

3at/9*k  -  aW  3bt/3*k  ■  ^t-(p-k)  •  (4-4) 

The  shift  B  p  in  the  operator  generating  b$  is  a  formality  to  ensure  merely  that  we 

can  continue  to  use  the  span  of  p  values  b*  •••  b*  which  would  otherwise 

1-p-q  -q 

contain  zeros. 

For  all  the  moving  average  parameters  we  generate  series 

a8  =  1  /9 ( B ) a ,  b9  «  1/8(B)b  (4.5) 
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from  which 


*at/a9k  "  *®t-k '  3bt/a8k  ”  b6t-k  *  <4.6  > 

thus  all  the  derivatives  of  S  can  be  constructed  with  a  further  n(p  +  2q)  mults 

including  accumulation  of  the  products,  if  (4.3)  Is  used  as  first  step  in  constructing 


Turning  now  to  O/30)log  det  A  •  tr(A  ^A/SB),  the  matrix  A  1  is  obtained  as  a  by 

product  of  solving  the  equations  for  Sw  ,  with  relatively  few  further  operations.  The 

L 


elements  of  3A/30  are  of  the  form 


3Aij/96k  ’  2<t  Vi^t-j-k  -  1  Vi^t-j-k1 


where  for  the  autoregressive  parameters  4y , 


*4  -  -1/8<b)I  and  x4  “  -B~P/8(B)I 


(4.7) 


(4.8) 


and  k  is  replaced  by  p  -  k  in  the  second  sum  of  (4.7),  as  in  (4  4).  For  the  moving 
average  parameters. 


w8  -  1/8(B)»  and  x®  “  1/8(B>x  • 

Thus  only  two  new  aeries  of  length  n  need  to  be  constructed,  and  the  sums  in  (4.7) 
may  again  be  formed  using  the  operator  *(B  %  as  in  the  construction  of  A,  with  as  few 

1  2 

as  2nq  further  mults.  Finally  the  trace  terms  may  be  accumulated  with  —  s  (p  ♦  q) 
mults. 

The  derivative  of  log  det  F  is  tr(F  ^SF/JB)  and  is  required  only  for  the  AR 
parameters.  In  fact  f'1  •  V  which  has  elements  »  vp  |  ^  j  for  i,j  »  1  •••  p, 

where  v  is  the  variance  of  a  series  following  the  AR(p)  model  with  o2  *  1,  and 
is  the  acf  of  the  series.  These  may  all  be  rapidly  constructed  as  described  by 
Tunnicliffe-Wilson  (1979),  following  the  construction  of  det  F  in  (2.1).  Then  the 
required  derivative  wrt  4^  is,  after  some  simplification,  given  by 


hk  ■  *2v  jQ  <P'  -  1  -  k,fiP|k-i| 


(4.9) 
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where  p'  ■  p  -  1.  This  agrees  with  tha  formula  obtained  from  (A7.5.16)  and  (A7.5.17)  in 
Box  and  Jenkins  (1976)  if  tha  former  ia  used  to  eliminate  P . .  .. 

|k-p[ 

In  total  therefore,  some  n(p  ♦  4q)  +  V^s  tp  ♦  q)  mult*  provide  all  the  first 
derivatives  of  the  likelihood,  barely  more  than  is  necessary  to  compute  the  likelihood 
itself. 


Hessian  approximation. 

It  is  possible  to  obtain  an  approximation  to  the  second  derivatives  of  D,  which  is 

in  many  cases  adequate  for  use  in  iterative  minimisation  schemes  similar  to  those  of 

Marquardt  (1963).  The  first  step  in  this  approximation  is  to  retain  only  the  term 

M1^n32Q/36^38^  on  dlfferentiatinq  D  -  t4^/>nQ.  This  is  reasonable  at  points  away  from 

the  boundary  of  the  parameter  space  determined  by  the  stationarity  and  invertibility 

conditions,  because  then  M  ~  1  and  can  be  treated  as  a  constant.  To  determine  the 

2 

matrix  Q68,  say  with  elements  3  0/36^38^,  we  use  the  corresponding  second  derivative 
matrix  of  SIw^.B)  partitioned  according  to  the  two  parameter  group*  «L  and  8  as 


Ml 


"21 


M2 


22 


Provided  H  is  evaluated  at  wT(8),  it  is  true  that  QB8 

L 


22 


H21H11H12*  It  is  not 


necessary  to  evaluate  this  explicitly  if  it  is  desired  to  use  a  quasi-Newton  method  to 

calculate  a  parameter  correction  86.  Setting  5'  »  (8w',SB’)  and  G'  -  (0',G(8)'),  it 

is  sufficient  to  solve  h8  ■  -G.  The  rero  in  G  comes  from  the  derivative  of  S  wrt 

wT .  The  correction  term  8w.  need  not  be  saved,  or  even  evaluated,  because  after  the  new 

parameter  set  B  ♦  88  is  constructed,  w  will  be  redetermined  at  this  new  set  at  the 

L 

start  of  the  next  Iteration. 

A  further  approximation  is  in  the  evaluation  of 


H  ■  2{(E  *t,i\,j  ♦ 1  •ts.ij’  •  <E  bt,ibt.j +  E  VUi” 


(4.10) 


The  second  sum  in  each  bracket  is  discarded  in  the  approximation,  where  e.g.  a.  . 
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represents  the  derivative  of  at  with  respect  to  the  l'th  parameter  In  the  set  (w^,8  > . 

If  at  is  linear  in  both  parameters,  at  ij  is  in  ®ny  case  0,  and  close  to  the  origin 
any  nonlinearity  in  the  parameters  8  Is  not  great.  Furthermore,  at  the  true  parameter 
values  the  second  sum  has  expectation  zero  and  magnitude  0(/n)  compared  with  a  magnitude 
of  0(n)  for  the  first  term. 

It  is  convenient  if  H  is  positive  definite.  The  above  approximation  ensures  this 
provided  only  that  the  model  contains  no  AD  parameters.  If  AR  parameters  are  present 
an  adequate  remedy  is  to  start  the  summation  of  the  'a'  terms  from  t  »  (1  -  q  +  p)  and 
to  drop  the  summation  in  the  ’b'  terms  whenever  the  derivative  is  wrt  to  at  least  one 
AR  parameter  (but  not  otherwise). 

Constraints  on  the  magnitude  of  the  parameter  correction  58  may  be  obtained  by 
adding  a  diagonal  matrix  A  to  H22*  And  adjusting  A  using  strategies  similar  to  those 
of  Marquardt  (1963)  to  ensure  a  decrease  in  the  deviance  at  successive  iterates.  The 
parameters  may  be  kept  within  their  constraints  by  this  device.  As  the  iterations 
converge  the  matrix  H  will  also  converge,  and  it  may  not  be  worthwhile  re-evaluating 
it.  From  the  Cholesky  factorization  of  H  the  matrix  L  »  QB8  1  may  then  be  evaluated 
and  used  unaltered  to  obtain  58  ■  -LG(8)  in  following  iterations.  A  further  possibility 
from  this  point  is  to  update  L  using  a  variable  metric  scheme  which  incorporates 
information  from  the  values  of  G(B)  at  successive  iterations.  A  library  optimization 
routine  might  be  used  for  this  purpose.  This  may  be  particularly  useful  when  the  minimum 
deviance  estimates  8  lie  close  to  the  boundary  of  the  parameter  space.  Experience  shows 
that  the  deviance  can  be  remarkably  flat  in  this  region  for  some  MA  models,  and  iterates 
can  be  slow  to  converge  if  high  precision  estimates  are  required. 

It  should  be  noted  that  all  the  terms  a,., a,.  ,  etc.  used  to  construct  the 

approximation  to  H  have  already  been  calculated  to  construct  the  deviance  and  its  first 

derivative.  Indeed,  H..  *  2A  as  used  to  obtain  w  .  Approximately  n(4q  +  2p)  further 

L 

mults  are  required  to  complete  H.  To  solve  the  equations  for  58  will  require 


approximately  (1/6) (a  ♦  p  *  q)  multa •  Hi*  computational  burden  la  therafore  once  more 

comparable  with  the  calculation  of  the  deviance  itaelf. 

Upon  convergence,  the  matrix  L  may  be  used  to  obtain  an  approximation  to  the 

*2 

diaperaion  matrix  C  of  the  parameter  eatimatea,  as  C  “  2lo  ,  where 
*2 

o  *  Q/( n  -  1  -  p  -  q).  The  factor  2  occurs  explicitly  in  all  the  derivative  formula  and 
because  it  cancels,  can  in  practice  be  omitted  throughout,  as  can  the  factor 


i 


> 


S.  SOME  FURTHER  CONSIDERATIONS 


Multiplicative  Operator*.  The  whole  of  tha  foregoing  algorithm  ia  eaaily  modified  to 

12 

allow  for  multiplicative  operator*  auch  aa  (1  -  80)  <1  -60  )  which  appear  in  aeaaonal 

models.  The  main  point  her*  ia  that  there  ia  no  necessity  to  multiply  out  the  operators 

12  -1 

in  order  to  apply  them,  e.g.  to  derive  *t  “  {1  -  60) (1  -  80  )}  wfc  it  ia  sufficient  to 

use 


\  "  “t  +  ®“t-«' 


at  -  «t  ♦  9Vl  • 


Furthermore,  the  derivatives  of  at  wrt  just  one  of  these  parameters  is  again 
simple  to  obtain,  e.g.  3*t/36  -  -1/8(0)*t  as  before.  The  use  of  multiplicative 
operators  is  also  useful  in  non-seasonal  contexts,  particularly  where  quasi-cyclic 
patterns  in  data  are  represented  by  AR  operators  of  order  two  with  complex  roots.  Where 
more  than  one  such  cycle  is  present,  a  second  order  operator  can  be  associated  with  each, 
as  an  explicit  factor  in  the  full  AR  operator.  Suppose  we  have  two  such  factors,  so 
that  e.g.  afc  »  ♦1 (B)62 (B)/0 (Bjw^,  then  we  use  simply  the  series 
3at/38t  -  -♦2<B)/6(B)wt  for  the  calculation  of  derivatives.  In  order  to  obtain  det  F 
in  this  case,  the  coefficients  f  -  <<o'f  1' *  *  *  ^pj+Pj’  can  **•  9*"*rated  applying 


f  -  ^1(B)f2(B)l  where  I  •  (1,0,...).  The  derivatives  of  log  det  F  wrt  the 


coefficients  ,  •••  ♦  ,  p  of  8,(8)  are  h,  -  (h^ ,, . . .  ,h,  ^ )  -  62<b  )h,  where 

h  -  (h  vh2' "  * 'hp1+p2>  ar*  aa  c*lculate<il  ln  with  p’  “  P1  ♦  p2  -  1  and  pfc 


derived  from  f. 

Missing  data.  Methods  based  on  the  Kalman  filter  ere  well  suited  to  handling  missing 
data.  Provided  there  is  not  a  very  large  number  of  missing  values,  the  algorithm 
presented  here  can  be  used  with  reasonable  efficiency,  and  certainly  with  convenience.  A 


missing  value  at  time  T,  say  w^  is  appended  to  the  set  w^  and  treated  in  exactly  the 


same  way.  The  derivatives  are  extracted  from  the  same  *  sequence,  and  used  in  the 
minimization  of  S  wrt  the  extended  set,  to  yield  P  and  det  A.  Note  that  A  is  also 
extended  in  size  by  these  missing  values,  and  this  treatment  supplies  both  the  estimates 
of  the  missing  values  and  the  exact  likelihood  of  the  set  of  observed  data  points.  An 
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alternative  is  to  treat  the  missing  values  as  nuisance  parameters  which  are  estimated  by 
least  squares.  This  would  correspond  to  the  intervention  analysis  approach  of  estimating 
missing  or  corrupted  data.  The  procedure  would  be  exactly  the  same  except  for  the 
definition  of  the  likelihood  in  which  det  A  ^  would  be  used  in  place  of  det  A,  where 
Ajj  is  the  sufcmatrix  of  A  corresponding  to  the  backforecasts  wL  above. 

If  there  is  a  large  consecutive  set  of  missing  values  then  it  is  again  possible  to 
make  economies  in  the  accumulation  of  the  elements  of  A,  but  this  is  not  worth  while  for 
irregularly  scattered  missing  data. 

Box-Cox  transformation  of  data.  Following  Box  and  Cox  (1964)  define 

f  (x^  -  1)/X  -  c  for  1  1  0 


w  ■ 


(,  log  x  -  c  for  X  »  0  . 

In  this  case  the  likelihood  is  multiplied  by  ITx^  1  and  the  deviance  is  redefined 
as  D  •  M1'^npG  ^  where  G  «  {Ilx^}  is  the  geometric  mean  of  the  data.  It  was 

pointed  out  by  Box  and  Cox  (1964)  that  if  the  data  is  first  scaled  by  G,  i.e.  xfc/G  is 
used  in  place  of  xfc,  then  the  geometric  mean  of  the  scaled  data  is  one,  and  the  Jacobian 
disappears  from  the  likelihood.  In  our  case  the  deviance  would  only  then  depend  on  X 
via  the  factor  Q.  We  do  not  use  this  device,  but  follow  its  implications.  Proceeding 
directly,  the  factor  G  ^  is  formally  retained  in  the  definition  of  D  and  all  its 
derivatives.  Including  those  wrt  X,  provided  we  use  3Q/9X  »  2 ( Z  a^aX^  -  I  b^bX^)  where 
e.g.  aX^  «  G^9(at/G^)/3X  =  Sa^/SX  -  a^log  G.  We  derive  Ba^/SX  as  xtBIwX^  where 
wX  ■»  3w/9X,  and  using  L  ”  log  x,  r  ■  XL,  e  “  exp(r)  ■  x*  and  w  ■  Ce  -  1)/X,  we  have 
wX  »  (Le  -  w)/X  for  X  a  0,  and  L2  for  X  «  o. 

A  similar  formula  is  used  in  constructing  the  approximation  to  the  Hessian,  so  that 

an  element  H8X,  where  8  is  any  other  parameter  is  formed  as  2(1  a®ta*t  -  t  b8fcbXt). 

2  2 

However,  when  the  element  HXX  is  formed  the  corresponding  expression  t  aX^  -  t  bX^ 
must  be  augnented  by  a  term 


vL  ■  Z  ataXXt  -  I  b^bXX^ 
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2  2  2 

where  e.g.  aXX^  •  3  a^/3X  -  a^dog  G)  .  This  supplies  the  exact  second  derivative  at 

the  minimum  of  p  wrt  X . 

The  point  here  ie  that  vL  does  not  become  negligible  for  large  n.  Indeed 
Brubacher  (1976)  shows  that  provided  wfc  is  Gaussian,  vL/n  ♦  Vardog  xfc)  >  0  at  the 
true  value  of  X.  It  is  therefore  recommended  that  whenever  vL  is  positive  it  is 
incorporated  in  HXX  throughout  the  iterations.  Besides  possibly  slowing  the  convergence 
of  the  iterates,  its  absence  will  lead  to  an  over  estimate  of  var(X)  from  the  Inverse 
Hessian. 

2  2  2  2  2 

Again  we  derive  3  a^/3X  as  »<B)wXX  where  now  wXX  *  3  w^/3X  «  (L  e  -  2wX)/X 

for  X  #  0  and  1/3L3  for  X  -  0.  It  is  important  to  avoid  numerical  instability  in  the 
region  close  to  X  -  0,  and  it  is  recommended  that  if  |r|  ”  |X  log  x|  is  small,  the 
transformation  and  its  derivatives  should  be  evaluated  using  the  first  two  or  three  terms 
in  the  series 

w  -  L(1  +  1/2r  +  1/6r2  +  •••) 
wX  -  L2<1/2  ♦  1/3r  +  1/8r2  ♦  •••> 
wXX  -  L3 ( 1/3  ♦  1/4r  ♦  1/10r2  +  •••)  . 

It  will  be  noted  that  the  factor  G  *  appears  in  both  the  first  derivatives 

of  D  and  in  the  approximations  to  the  second  derivatives,  so  that  it  may  be  omitted  from 
both  of  these  (in  the  same  way  as  was  omitted)  when  solving  the  equations  for  the 

parameter  corrections. 
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6.  IMPLEMENTATION 


The  foregoing  algorithm  has  been  implemented  and  the  numerical  valuee  obtained  for 
the  likelihood  in  many  examples  have  been  found  to  agree  to  machine  accuracy  with  the 
valuee  obtained  from  the  algorithm  of  Milard  (1983),  which  have  themeelvee  been  checked 
with  valuee  from  the  algorithme  of  Gardner  at  al.  (1980)  and  Analey  (1979).  No  comparieon 
of  actual  computation  timea  haa  been  made,  partly  becauae  not  all  the  poaaible  economiee 
in  confutation  have  yet  been  implemented,  but  moatly  becauae  the  algorithme  have  been 
embedded  in  a  library  routine  and  atatiatical  package,  both  deaigned  to  derive  the  maximum 
likelihood  (minimum  deviance)  eatimatea.  From  the  point  of  view  of  the  uaer  it  la  the 
numerical  accuracy  and  efficiency  in  locating  thaae  eatimatea  which  ia  of  prime 
importance.  Practical  experience  auggeats  that  for  nonaeaaonal  model*  and  for  seaaonal 
modela  with  period  12,  auch  aa  uaed  for  the  airline  data  in  Box  and  Jenkina,  the 
computational  requirements  are  moat  reaaonable,  being  of  the  order  5  seconds  on  a  modern 
machine,  for  the  10  to  15  iterations  required. 
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